A Compile-Time Data Locality Optimization Framework for NUCA Chip Multiprocessors
نویسندگان
چکیده
With increasing numbers of cores, future CMPs (Chip MultiProcessors) are likely to have a tiled architecture with a portion of shared L2 cache on each tile and a bank-interleaved distribution of the address space. For data-parallel programming models, there is a mismatch between such a non-uniform cache organization and the canonical row-major or column-major layouts of multi-dimensional arrays – causing a significant number of non-local L2 accesses for many commonly occurring data access patterns. In this paper we develop a compile-time framework for data locality optimization via data layout transformation. Using a polyhedral model for dependences, the program’s localizability is determined by analysis of intraand inter-statement dependences, followed by non-canonical data layout transformation to reduce non-local accesses for localizable computations. Simulation-based results on a 16-core 2D tiled CMP demonstrate the effectiveness of the approach.
منابع مشابه
Judicious Thread Migration When Accessing Distributed Shared Caches
Chip-multiprocessors (CMPs) have become the mainstream chip design in recent years; for scalability reasons, designs with high core counts tend towards tiled CMPs with physically distributed shared caches. This naturally leads to a Non-Uniform Cache Architecture (NUCA) design, where onchip access latencies depend on the physical distances between requesting cores and home cores where the data i...
متن کاملAdaptive Zone-Aware Multi-bank on Chip last level L2 Cache Partitioning for Chip Multiprocessors
This paper proposes a novel efficient Non-Uniform Cache Architecture (NUCA) scheme for the Last-Level Cache (LLC) to reduce the average on-chip access latency and improve core isolation in Chip Multiprocessors (CMP). The architecture proposed is expected to improve upon the various NUCA schemes proposed so far such as S-NUCA, D-NUCA and SP-NUCA[9][10][5] in terms of average access latency witho...
متن کامل3D Tree Cache – A Novel Approach to Non- Uniform Access Latency Cache Architectures for 3D CMPs
We consider a non-uniform access latency cache architecture (NUCA) design for 3D chip multiprocessors (CMPs) where cache structures are divided into small banks interconnected by a network-on-chip (NoC). In earlier NUCA designs, data is placed in banks either statically (S-NUCA) or dynamically (D-NUCA). In both SNUCA and D-NUCA designs, scaling to hundreds of cores can pose several challenges. ...
متن کاملAnalysis of Non-Uniform Cache Architecture Policies for Chip-Multiprocessors Using the Parsec Benchmark Suite
Non-Uniform Cache Architectures (NUCA) have been proposed as a solution to overcome wire delays that will dominate on-chip latencies in Chip Multiprocessor designs in the near future. This novel means of organization divides the total memory area into a set of banks that provides nonuniform access latencies and thus faster access to those banks that are close to the processor. A NUCA model can ...
متن کاملPerformance Analysis of Non-Uniform Cache Architecture Policies for Chip-Multiprocessors Using the Parsec v2.0 Benchmark Suite
Non-Uniform Cache Architectures (NUCA) have been proposed as a solution to overcome wire delays that will dominate on-chip latencies in Chip Multiprocessor designs in the near future. This novel means of organization divides the total memory area into a set of banks that provides non-uniform access latencies and thus faster access to those banks that are close to the processor. A NUCA model can...
متن کامل